Predicting Colonization Level (Log10 CFU) with Day 0 Community

Modeling Colonization level (Top 10%)

Model built using all OTUs present in >10% of samples, only displaying top 10% of predictive features

The % Variation explained in the randomized model is:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.4787, -0.2882, -0.1632, -0.1622, -0.05705, 0.3652

OTUs predictive of Colonization level (Top 10%)

Red OTUs are top 12 which are plotted against CFU below

Correlation of OTU and CFU (top 12 OTUs predictive of Colonization level)

Points colored by human donor

Are these OTUs identifying susceptiblility to C. Difficile of only correlated to C. difficile due to OTUs unique to the donor, i.e. is the OTU predicting the cage grouping?


Predicting Human Donor with Day 0 Community

Is there overlap in OTUs predictive of CFU and Donor?

Modeling of Human Donor (Top 10%)

Model built using all OTUs present in >10% of samples, only displaying top 10% of predictive features

##           Reference
## Prediction DA00369 DA00430 DA00581 DA00953 DA01146 DA01324 DA10027 DA10093
##    DA00369       4       0       0       0       0       0       0       0
##    DA00430       0       4       0       0       0       0       0       0
##    DA00581       0       0       8       0       0       0       0       0
##    DA00953       0       0       0       3       0       0       0       0
##    DA01146       0       0       0       0       4       0       0       0
##    DA01324       0       0       0       0       0       2       0       0
##    DA10027       0       0       0       0       0       0       3       0
##    DA10093       0       0       0       0       0       0       0       2
##    DA10148       0       0       0       0       0       0       0       0
##           Reference
## Prediction DA10148
##    DA00369       0
##    DA00430       0
##    DA00581       0
##    DA00953       0
##    DA01146       0
##    DA01324       0
##    DA10027       0
##    DA10093       0
##    DA10148       3

Models predicting randomized donor perform with a high error rate:
Mean Error Rate = 0.8782 with an IQR of 0.8485 - 0.9167.

OTUs predictive of Human Donor (Top 10%)

Red OTUs are top 12 which are plotted against CFU below

Correlation of OTU and CFU (top 12 OTUs predictive of Human Donor)

Points colored by human donor

There are 3 OTUs that overlap between predicting Colonization and Source:
Otu000014, Otu000005, Otu000283
It is possible these OTUs are selected in predicting Colonization because they are helping to seperate the groups of mice by cage. It is interesting though that the OTU 4 and OTU 14 were previously identified as being correlated (rho = 0.7586887) but only OTU 14 is predictive of Donor. Otu000004 is in the top of OTUs predictive of Human Donor with a relative MDA of 0.3047336.

As for OTU 283 and 5, those are correlated (rho = 0.8235480) and they are also correlated to Otu000171 (0.7891608), Otu000196 (0.8190665), Otu000202 (0.8147822), Otu000255 (0.8230815), Otu000262 (0.8141814), Otu000283 (0.8268792), Otu000394 (0.8321597), Otu000403 (0.8258554), which are all in the top 10% of OTUs predictive of Colonization.

Do correlated OTUs matter?


Model Colonization level with Human Donor included

Model built using all OTUs present in >10% of samples, only displaying top 10% of predictive features

The % Variation explained in the randomized model is:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.4787, -0.2882, -0.1632, -0.1622, -0.05705, 0.3652

OTUs predictive of Colonization level with Source included (Top 10%)

Top 12 OTUs in red are plotted against CFU below

There are 18 OTUs that overlap between predicting Colonization and Source:
Otu000403, Otu000019, Otu000171, Otu000014, Otu000394, Otu000004, Otu000005, Otu000202, Otu000255, Otu000160, Otu000021, Otu000087, Otu000176, Otu000196, Otu000283, Otu000151, Otu000001, Otu000213
Including Human Source includes all the OTUs predictive of Colonization as well as Human Source as the top predictor. Relative importances of some OTUs are shifted, which could be due to the randomness of the RF algorithm or due to the effect of adding Human source, which may imply part of their predictive power is attributed to the human donor dependency. OTU 357 is added, however it is just below the 10% cutoff for the model without human source. Thus there appears to be no significant difference in predictive OTUs in the presence/absence of human donor data.

There are 3 OTUs that overlap between predicting Colonization and Source:
Otu000014, Otu000005, Otu000283
Comparing the OTUs predictive of Colonization when including human source is similarlly different from the OTUs predictive of Human Donor. The same OTUs are shared, which is not suprising since both models predicting Colonization use the same OTUs. Aside from those shared and even the correlated OTUs we still have a much different set of predictors, which seems to potentially reveal the ability of the microbiome to predict community members associated with C. difficile colonization.



Modeling CFU with One Mouse per Human Donor

In an effort to remove any potential donor community dependency, the model is trained on a single mouse per human donor source, and then tested on all the remaining mice

With all OTUs

When using all OTUs the model preforms poorly. The median Rsq of the trainset was -0.1447438, this is likely do to the features. This is a summary of the importance scores for all predictors:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
-3.981, -0.603, 0, -0.03323, 0, 7.912, 203
75.33% of OTUs have an importance of 0 or less. The median Rsq of the test set was 0.4195982 and the Rsq of all predicted values was 0.4417457.

This one mouse per source model is improved through reducing the amount of negative features, either utilizing the feature set via the top X % of predictors from the model of all mice or take only the features that are more often have an importance > 0 (OTUs : Otu000019, Otu000086, Otu000048, Otu000015, Otu000001, Otu000030, Otu000018, Otu000027, Otu000088, Otu000106, Otu000016, Otu000080, Otu000110, Otu000114, Otu000011, Otu000087, Otu000160, Otu000006, Otu000003, Otu000002, Otu000049, Otu000032).

One Mouse per Human Donor with OTUs > 0 give:
* median training set rsq of 0.08652421
* median testing set rsq of 0.5331836
* overall rsq of 0.5534567

One Mouse per Human Donor with the top 10% predictive OTUs when using all mice
* median training set rsq of 0.07336261
* median testing set rsq of 0.6070359
* overall rsq of 0.633238

The % var explained values from each individual iteration appears to have a bimodal distribution.

Is this bimodal distribution due to :
* Due to small n?
* Due to incomplete source dependency?
* Ratio of train to test?

Using random forest to identify the mice driving bimodal Rsq distribution

Cages A, C, and D all have the same donor.
Is it possible one of the cages of the same donor are not similar, and causing the poor performance?

The distribution of Rsq values when modeling by cage appears to be improved in comparison of by human donor

By cage:
Min., 1st Qu., Median, Mean, 3rd Qu., Max.,
0.5272, 0.6661, 0.7113, 0.7138, 0.7559, 0.8548

By donor:
Min., 1st Qu., Median, Mean, 3rd Qu., Max.,
0.2779, 0.4622, 0.5491, 0.5643, 0.6797, 0.7776

It seems the assumption all donors have the same community may be false because the bimodal distribution appears to be attributed to which mice from that group are used. However, when running the analysis with only one cage from the donor with multiple cages (581) it performs as we observe for the lower peak in the bimodal distribution and when a second cage from the same donor is added we observe Rsq values at the higher peak in the bimodal distribution. Thus there appears to be a cage effect improving the performance and the bimodal distribution maybe due to a variation in the ability of these communities to be able to predict each other, ie some may be more similar to each other but still as a group more different than the other cages.



So how does this cage/donor affect the overall CFU prediction?

Leave One Out - Model Colonization level with Human Donor included

Model built using all OTUs present in >10% of samples, only displaying top 10% of predictive features

Leave one Cage out:

Rsq of all predictions - 0.5594288
Rsq of each iteration - 0.86, 0.85, 0.83, 0.82, 0.79, 0.85, 0.79, 0.86, 0.83, 0.88, 0.87
(cage order = 369, 430, A, C, D, IN3, LINE, MOUT, NP2, OUT2, OUTA)

Leave one Source out:

Rsq of all predictions - -0.1834717
Rsq of each iteration - 0.86, 0.85, 0.54, 0.85, 0.79, 0.86, 0.83, 0.88, 0.87
(source order = DA00369, DA00430, DA00581, DA00953, DA01146, DA10093, DA01324, DA10148, DA10027)

Removing the repeated cages of DA00581 (A and C) causes the model to perform poorly leaving one out.
The performance of the whole model appears to be dependent on two grouping of points (one around ~5-6 log (all Donor 581) and one around ~8 log ).
How well can random forest model the data with out the DA00581 source?

Leaving one cage/source out (without Donor 581):

Rsq of all predictions - -0.1864957
Rsq of each iteration - 0.59, 0.46, 0.59, 0.12, 0.6, 0.49, 0.74, 0.61
(cage/source order = DA00369, DA00430, DA00953, DA01146, DA10093, DA01324, DA10148, DA10027)

While this is removing 3 cages (8 mice) and decreasing the n, it is removing the smaller, lower grouping of points. So how well can the whole model predict CFU without the DA00581 samples?

Model Colonization level with Human Donor included (without Donor 581):

The % Variation explained in the randomized model is:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.6202, -0.2761, -0.1741, -0.1574, -0.04562, 0.3898

Including one Donor 581 cage increases performance:
w/Cage A: 62.9% Var Explained
w/Cage C: 49.59% Var Explained
w/Cage D: 67.85% Var Explained

Without human donor, performance was slightly lower with similar top features

This decrease in performance when removing Human Donor 581 suggests that with this data set, random forest is unable to predict CFU with a high degree of accuracy. This model seems to potentially be identifying the split between the high and low CFU groups, although this is difficult to say with any weight since we only have one source in the low grouping.

Remove High and Low CFU sources

The % Variation explained in the randomized model is:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.5345, -0.325, -0.2058, -0.1824, -0.08687, 0.4237



Regression cannot be used to predict colonization levels in this dataset

Since the % Variation explained by the RF model decreases to ~0% when removing the extreme values, the suggests that we do not have the data to make accurate predictions of the colonization level due to the low number of samples at the extremes. However we do seem to have a separation between the high and low but we cannot model the middle levels with linear regression. So another way we could look at this would to classify the groups.

Furthermore, the variation in the steady state CFU seems to have a high degree of overlap in this middle grouping of colonization level, thus it does not appear possible to precisely define a CFU level to predict relative to the others.

NMDS of colonization level
## Run 0 stress 0.1450506 
## Run 1 stress 0.1588568 
## Run 2 stress 0.1500734 
## Run 3 stress 0.1553419 
## Run 4 stress 0.1796939 
## Run 5 stress 0.1551527 
## Run 6 stress 0.1640161 
## Run 7 stress 0.1576915 
## Run 8 stress 0.1500734 
## Run 9 stress 0.1555623 
## Run 10 stress 0.144962 
## ... New best solution
## ... Procrustes: rmse 0.02333755  max resid 0.1096514 
## Run 11 stress 0.1553422 
## Run 12 stress 0.144962 
## ... Procrustes: rmse 1.144152e-05  max resid 2.126784e-05 
## ... Similar to previous best
## Run 13 stress 0.1724275 
## Run 14 stress 0.1551531 
## Run 15 stress 0.1729041 
## Run 16 stress 0.1553419 
## Run 17 stress 0.1500743 
## Run 18 stress 0.1450508 
## ... Procrustes: rmse 0.02334761  max resid 0.1088818 
## Run 19 stress 0.1449619 
## ... New best solution
## ... Procrustes: rmse 4.958642e-05  max resid 0.0001029263 
## ... Similar to previous best
## Run 20 stress 0.1642087 
## *** Solution reached

Stress = 0.1449619

AMOVA

hi-lo-mid Among Within Total SS 4.48543 8.08716 12.5726 df 2 30 32 MS 2.24272 0.269572

Fs: 8.31955 p-value: <0.001*

hi-lo Among Within Total SS 2.4372 0.676826 3.11403 df 1 10 11 MS 2.4372 0.0676826

Fs: 36.0093 p-value: <0.001*

hi-mid Among Within Total SS 2.08215 7.41077 9.49293 df 1 23 24 MS 2.08215 0.322207

Fs: 6.46215 p-value: <0.001*

lo-mid Among Within Total SS 2.30066 8.08673 10.3874 df 1 27 28 MS 2.30066 0.299508

Fs: 7.68145 p-value: <0.001* ###### significantly different centroids

HOMOVA

HOMOVA BValue P-value SSwithin/(Ni-1)_values hi-lo-mid 21.6225 <0.001* 0.000145266 0.0966271 0.370517 hi-lo 14.1638 <0.001* 0.000145266 0.0966271 hi-mid 18.2515 <0.001* 0.000145266 0.370517 lo-mid 3.48295 <0.001* 0.0966271 0.370517

Classification of Colonization Level

Features driving classification

While these features seem to drive the grouping, it still needs to be further validated/investigated to see if these results repeat in similar microbiomes since the high and low are largely homogenous communities and the middle group is made of heterogenous commnunities.